48 Coresets and Sketches
نویسنده
چکیده
Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time, large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most important classes of these summaries. A coreset is a reduced data set which can be used as proxy for the full data set; the same algorithm can be run on the coreset as the full data set, and the result on the coreset approximates that on the full data set. It is often required or desired that the coreset is a subset of the original data set, but in some cases this is relaxed. A weighted coreset is one where each point is assigned a weight, perhaps different than it had in the original set. A weak coreset associated with a set of queries is one where the error guarantee holds for a query which (nearly) optimizes some criteria, but not necessarily all queries; a strong coreset provides error guarantees for all queries. A sketch is a compressed mapping of the full data set onto a data structure which is easy to update with new or changed data, and allows certain queries whose results approximate queries on the full data set. A linear sketch is one where the mapping is a linear function of each data point, thus making it easy for data to be added, subtracted, or modified. These definitions can blend together, and some summaries can be classified as either or both. The overarching connection is that the summary size will ideally depend only on the approximation guarantee but not the size of the original data set, although in some cases logarithmic dependence is acceptable. We focus on five types of coresets and sketches: shape-fitting (Section 48.1), density estimation (Section 48.2), high-dimensional vectors (Section 48.3), highdimensional point sets / matrices (Section 48.4), and clustering (Section 48.5). There are many other types of coresets and sketches (e.g., for graphs [AGM12] or Fourier transforms [IKP14]) which we do not cover due to space limitations or because they are less geometric.
منابع مشابه
49 CORESETS and SKETCHES
Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most import...
متن کاملCoresets and Sketches
Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most import...
متن کاملOne-Shot Coresets: The Case of k-Clustering
Scaling clustering algorithms to massive data sets is a challenging task. Recently, several successful approaches based on data summarization methods, such as coresets and sketches, were proposed. While these techniques provide provably good and small summaries, they are inherently problem dependent — the practitioner has to commit to a fixed clustering objective before even exploring the data....
متن کاملScalable and Distributed Clustering via Lightweight Coresets
Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight cor...
متن کاملCoresets for Nonparametric Estimation - the Case of DP-Means
Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore the use of coresets – a data summarization technique originating from computational geometry – for this task. Coresets are weighted subsets of the data such that models trained on these coresets are provably competitive with models trained on the full dataset. Coresets sublinear in the dataset si...
متن کامل